Search CORE

70 research outputs found

Quality Assessment of Linked Datasets using Probabilistic Approximation

Author: A Hogan
AZ Broder
BH Bloom
C Guéret
JS Vitter
P Hitzler
Publication venue
Publication date: 17/03/2015
Field of study

With the increasing application of Linked Open Data, assessing the quality of datasets by computing quality metrics becomes an issue of crucial importance. For large and evolving datasets, an exact, deterministic computation of the quality metrics is too time consuming or expensive. We employ probabilistic techniques such as Reservoir Sampling, Bloom Filters and Clustering Coefficient estimation for implementing a broad set of data quality metrics in an approximate but sufficiently accurate way. Our implementation is integrated in the comprehensive data quality assessment framework Luzzu. We evaluated its performance and accuracy on Linked Open Datasets of broad relevance.Comment: 15 pages, 2 figures, To appear in ESWC 2015 proceeding

arXiv.org e-Print Archive

Crossref

Fraunhofer-ePrints

Sampled Weighted Min-Hashing for Large-Scale Topic Mining

Author: AZ Broder
DM Blei
G Fuentes Pineda
G Salton
GE Hinton
O Chum
YW Teh
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 07/09/2015
Field of study

We present Sampled Weighted Min-Hashing (SWMH), a randomized approach to automatically mine topics from large-scale corpora. SWMH generates multiple random partitions of the corpus vocabulary based on term co-occurrence and agglomerates highly overlapping inter-partition cells to produce the mined topics. While other approaches define a topic as a probabilistic distribution over a vocabulary, SWMH topics are ordered subsets of such vocabulary. Interestingly, the topics mined by SWMH underlie themes from the corpus at different levels of granularity. We extensively evaluate the meaningfulness of the mined topics both qualitatively and quantitatively on the NIPS (1.7 K documents), 20 Newsgroups (20 K), Reuters (800 K) and Wikipedia (4 M) corpora. Additionally, we compare the quality of SWMH with Online LDA topics for document representation in classification.Comment: 10 pages, Proceedings of the Mexican Conference on Pattern Recognition 201

arXiv.org e-Print Archive

Crossref

Interval Selection in the Streaming Model

Author: AW Kolen
AZ Broder
BV Halldórsson
DS Hochbaum
E Kushilevitz
J Feigenbaum
M Datar
P Indyk
TS Jayram
Y Emek
Publication venue
Publication date: 04/02/2015
Field of study

A set of intervals is independent when the intervals are pairwise disjoint. In the interval selection problem we are given a set

\mathbb{I}

of intervals and we want to find an independent subset of intervals of largest cardinality. Let

\alpha(\mathbb{I})

denote the cardinality of an optimal solution. We discuss the estimation of

\alpha(\mathbb{I})

in the streaming model, where we only have one-time, sequential access to the input intervals, the endpoints of the intervals lie in

\{1,...,n \}

, and the amount of the memory is constrained. For intervals of different sizes, we provide an algorithm in the data stream model that computes an estimate

\hat\alpha

\alpha(\mathbb{I})

that, with probability at least

2/3

, satisfies

\tfrac 12(1-\varepsilon) \alpha(\mathbb{I}) \le \hat\alpha \le \alpha(\mathbb{I})

. For same-length intervals, we provide another algorithm in the data stream model that computes an estimate

\hat\alpha

\alpha(\mathbb{I})

that, with probability at least

2/3

, satisfies

\tfrac 23(1-\varepsilon) \alpha(\mathbb{I}) \le \hat\alpha \le \alpha(\mathbb{I})

. The space used by our algorithms is bounded by a polynomial in

\varepsilon^{-1}

and

\log n

. We also show that no better estimations can be achieved using

o(n)

bits of storage. We also develop new, approximate solutions to the interval selection problem, where we want to report a feasible solution, that use

O(\alpha(\mathbb{I}))

space. Our algorithms for the interval selection problem match the optimal results by Emek, Halld{\'o}rsson and Ros{\'e}n [Space-Constrained Interval Selection, ICALP 2012], but are much simpler.Comment: Minor correction

arXiv.org e-Print Archive

Crossref

Fast Locality-Sensitive Hashing Frameworks for Approximate Near Neighbor Search

Author: A Andoni
AL Zobrist
AZ Broder
JL Carter
JL Carter
K Terasawa
M Dubiner
MH Overmars
ML Fredman
N Sundaram
P Li
S Har-Peled
T Hagerup
Publication venue
Publication date: 16/02/2018
Field of study

The Indyk-Motwani Locality-Sensitive Hashing (LSH) framework (STOC 1998) is a general technique for constructing a data structure to answer approximate near neighbor queries by using a distribution

\mathcal{H}

over locality-sensitive hash functions that partition space. For a collection of

n

points, after preprocessing, the query time is dominated by

O(n^{\rho} \log n)

evaluations of hash functions from

\mathcal{H}

and

O(n^{\rho})

hash table lookups and distance computations where

\rho \in (0,1)

is determined by the locality-sensitivity properties of

\mathcal{H}

. It follows from a recent result by Dahlgaard et al. (FOCS 2017) that the number of locality-sensitive hash functions can be reduced to

O(\log^2 n)

, leaving the query time to be dominated by

O(n^{\rho})

distance computations and

O(n^{\rho} \log n)

additional word-RAM operations. We state this result as a general framework and provide a simpler analysis showing that the number of lookups and distance computations closely match the Indyk-Motwani framework, making it a viable replacement in practice. Using ideas from another locality-sensitive hashing framework by Andoni and Indyk (SODA 2006) we are able to reduce the number of additional word-RAM operations to

O(n^\rho)

.Comment: 15 pages, 3 figure

arXiv.org e-Print Archive

Crossref

The Bloom Clock for Causality Testing

Author: AD Kshemkalyani
AD Kshemkalyani
AD Kshemkalyani
AZ Broder
B Bloom
B Charron-Bost
CJ Fidge
FJ Torres-Rojas
J Misra
L Lamport
M Singhal
R Schwarz
S Tarkoma
T Pozzetti
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 23/11/2020
Field of study

Testing for causality between events in distributed executions is a fundamental problem. Vector clocks solve this problem but do not scale well. The probabilistic Bloom clock can determine causality between events with lower space, time, and message-space overhead than vector clock; however, predictions suffer from false positives. We give the protocol for the Bloom clock based on Counting Bloom filters and study its properties including the probabilities of a positive outcome and a false positive. We show the results of extensive experiments to determine how these above probabilities vary as a function of the Bloom timestamps of the two events being tested, and to determine the accuracy, precision, and false positive rate of a slice of the execution containing events in the temporal proximity of each other. Based on these experiments, we make recommendations for the setting of the Bloom clock parameters. We postulate the causality spread hypothesis from the application's perspective to indicate whether Bloom clocks will be suitable for correct predictions with high confidence. The Bloom clock design can serve as a viable space-, time-, and message-space-efficient alternative to vector clocks if false positives can be tolerated by an application

arXiv.org e-Print Archive

Crossref

BioFed: federated query processing over life sciences linked open data

Author: A González
AZ Broder
C Goble
H Wu
HF Deus
J Umbrich
KH Cheung
L McCarthy
LD Stein
S Bechhofer
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Functional limit theorems for random regular graphs

Author: A Lubotzky
A Lytova
A Soshnikov
AZ Broder
B Bollobás
BD McKay
C Greenhill
D Jonsson
DA Freedman
E Lubetzky
Elliot Paquette
EP Wigner
G BenArous
GW Anderson
Ioana Dumitriu
J Friedman
J Friedman
J Pitman
JC Mason
K Johansson
K Johansson
Kelly Wieand
N Alon
N Linial
O Costin
P Diaconis
P Lieby
R Arratia
S Chatterjee
S Chatterjee
S Sodin
SN Ethier
Soumik Pal
T Cabanal-Duvillard
T Kusalik
Tobias Johnson
U Feige
Ya Sinai
Z Füredi
ZD Bai
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 30/06/2012
Field of study

Consider d uniformly random permutation matrices on n labels. Consider the sum of these matrices along with their transposes. The total can be interpreted as the adjacency matrix of a random regular graph of degree 2d on n vertices. We consider limit theorems for various combinatorial and analytical properties of this graph (or the matrix) as n grows to infinity, either when d is kept fixed or grows slowly with n. In a suitable weak convergence framework, we prove that the (finite but growing in length) sequences of the number of short cycles and of cyclically non-backtracking walks converge to distributional limits. We estimate the total variation distance from the limit using Stein's method. As an application of these results we derive limits of linear functionals of the eigenvalues of the adjacency matrix. A key step in this latter derivation is an extension of the Kahn-Szemer\'edi argument for estimating the second largest eigenvalue for all values of d and n.Comment: Added Remark 27. 39 pages. To appear in Probability Theory and Related Field

arXiv.org e-Print Archive

CiteSeerX

Crossref

Efficient hash tables for network applications

Author: A Kirsch
A Kirsch
A Kirsch
AZ Broder
B Chazelle
B Vöcking
BH Bloom
D Ficara
H Song
L Fan
LJ Carter
M Dietzfelbinger
M Mitzenmacher
MD Mitzenmacher
ML Fredman
S Cohen
S Kumar
Y Azar
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Node embeddings in dynamic graphs

Author: A Ahmed
A Al-Maskari
A Bifet
A Grover
A McGregor
AD Sarma
AP Dawid
AZ Broder
B Perozzi
C Aggarwal
C Zhou
C. -F. Juang
CL Clarke
D Fogaras
D Liben-Nowell
F Béres
F Nie
G De Francisci Morales
G Widmer
GH Nguyen
I žliobaite
J Gama
J Kunegis
J Moody
J Qiu
J Tang
M Belkin
M Ou
N Kaji
N Lathia
O Press
P Holme
P Rozenshtein
R Kumar
R Pálovics
S Cao
T Mikolov
WL Hamilton
X Wang
X Wei
Y Yu
Y Zhu
Y Zuo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Crossref

SZTAKI Publication Repository

Similarity Sketching

Author: A Andoni
AZ Broder
AZ Broder
F Chierichetti
H Jégou
Jingdong Wang
P Li
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Crossref

The IT University of Copenhagen's Repository